Significant hot hand effect in the game of cricket

We investigate the predictability and persistence of individual and team performance (hot-hand effect) by analyzing the complete recorded history of international cricket. We introduce an original temporal representation of performance streaks, which is suitable to be modelled as a self-exciting point process. We confirm the presence of predictability and hot-hands across the individual performance and the absence of the same in team performance and game outcome. Thus, Cricket is a game of skill for individuals and a game of chance for the teams. Our study contributes to recent historiographical debates concerning the presence of persistence in individual and collective productivity and success. The introduction of several metrics and methods can be useful to test and exploit clustering of performance in the study of human behavior and design of algorithms for predicting success.

• In both formats the team performance in can be considered as the aggregation of participating individual performances.
• Individual performance 1. We look at the individual batting performances for our study.
2. An innings is one of the divisions of a cricket match during which one team takes its turn to bat.
3. We call the total scored runs by an individual as the performance. While doing this, we add a water level of 1 with the runs, i.e, S j (t) = Run j (t) + 1. By doing this, we set the smallest score to 1, which provides us with a well-defined performance fingerprint value for all the performance values. In other words, adding 1 removes the singularity associated with zero run scored in our performance fingerprint (1).
• Team performance 1. For both ODI and Test cricket, the performance of the first batting team determines the trajectory for rest of the game. Hence, we only consider the batting performances of the first batting team for the quantification.
2. For the analysis, we only take into account the games that had a definite outcome, i.e, Win or Loss for either of the teams. We don't study the games where there was no winning team for the match. show the joint distribution of the relative difference of the indices of second best from the best, plotted against the third best from the best performances in the dataset. (B) and (D) show the joint distribution of the same quantities but the quantities are measured from randomly shuffled performance sequences in the dataset. The p-values from 2D Kolmogorov-Smirnov two sample test [2,5] is presented in each cases. iv B A Figure S3: Hawkes point process along the colored noise performance time. (A) Kernel Density Estimation (KDE) of branching ratios obtained from colored noise [9]. β is the exponent of the colored noise. The inset figure represents the median value of branching ratio plotted against the exponent of the underlying colored noise. (B) The Auto Correlation Function (ACF) of the colored noise with the the given exponent values.
v Figure S4: Robustness of R(∆t): We construct 500 null R(∆t) when the data is random and compare it with the obtained results to check the significance and robustness of the obtained results . We first reshuffle the true data to obtain a random data. We consider this random data as the pseudo true data. We reshuffle this pseudo true data to construct the corresponding null R(∆t) (please see methods section). This gives a realisation of R(∆t) for a time series where there is no signal. We repeat this 500 times to obtain the confidence intervals. The red lines are the 500 realizations of the null R(∆t) signals . The blue line is the median of 500 true signals. The p-values around the center suggest the robustness of the signals. The inset figures represent the comparison of the two signals with corresponding 95% confidence intervals. We calibrate all the individual performances in ODI cricket using the ARIMA model. We conduct differencing tests (Kwiatkowski-Phillips-Schmidt-Shin, Augmented Dickey-Fuller and Phillips-Perron) to determine the order of differencing, 'd'. We analyse the autocorrelation functions (ACF), partial auto-correlation functions (PACF) and use the Canova-Hansen method to determine the optimal value of 'p' and 'q' for each model. Where p is the number of autoregressive terms, d is the number of nonseasonal differences needed for stationarity, and q is the number of lagged forecast errors in the prediction equation. We perform the ARIMA calibration 100 times on each performance sequence and randomly shuffled performance sequence (null), and record the log-likelihood scores. The main figure represents the Kernel Density Estimation (KDE) of median log-likelihood scores obtained from the original performance sequence and the shuffled performance sequence. The inset figure shows the KDE for the relative difference of median loglikelihood scores obtained from the original performance sequence and the shuffled null performance sequence. The ARIMA model is not able to identfiy the hot hand effect, as there are no significant differences between data and the null. Test B data null Figure S6: Auto-correlation : We estimate the auto-correlation values for a lag value of 1 time step, for consecutive player performances (data) and for the randomly shuffled (null) player performances. Panel A represents the kernel density of the auto-correlation values from ODI format and panel B represents the Test format. In both cases, the median value obtained from both the data and the null are around ∼ 0.0. The auto-correlation metric does not allow us to identify the hot hand effect, as there are no significant differences between data and the null.

Model comparison
• We perform Shapiro-Wilk test on the the original null log-likelihood distributions for each career and find that the paired differences are not normally distributed.
• We thus perform the Wilcoxon signed-rank test to determine the significance of the observed results.
• We found that 223 out of 610 players (or 36% players) have a significantly better (with confidence level 5%) median log-likelihood score in original performance sequence compared to the shuffled sequence. Thus the probability of falsely rejecting the null hypothesis is < 10 −6 . And we conclude that there is some predictable pattern hidden inside the performance sequences.
• While we compare these results with the model proposed in the main text, we find that our model produces better prediction results on 46.8% of the player compared to 36% players better prediction results with the ARIMA model.   Table S2: Prediction of team performance in Test format. We partition the performances into training and validation set. We quantify the log-likelihood scores (L) for the control and model forecasts 100 times and note down the median values. Similarly, we estimate the branching ratio (n) on the original data and shuffled data (Null), 100 times and note down the median values. We perform the Wilcoxon signed-rank test to determine the significance values for the observations. We respectively present (from left to right columns) the median value of branching ratio obtained from the data ( Data ), the median value of branching ratio obtained from the randomly shuffled data ( N ull ), if Data > N ull and the p-value. The median value of log-likelihood score obtained from the Model ( M odel ), the median value of log-likelihood score obtained from the control ( Control ), if M odel > Control and the p-value.
x Hot winning hands

Multiple Hypothesis Testing
We face the problem of multiple hypothesis testing, while simultaneously considering multiple individual tests on the same dataset or dependent datasets. In order to resolve the above problem, a number of methods have been proposed, which try to solve the issue by correcting the error rates of individual tests according to the number of simultaneously considered hypotheses as well as p-values for the individual tests. Below is a list of methods that we have implemented to test our hypotheses [6].

Sidak's Test
Let p 1 , . . . , p m be the p-values for the family of m null hypotheses H 1 , . . . , H m . If we set the familywise alpha level to α, according to the test, we reject all null hypotheses that have a p-value lower than α SID = 1 − (1 − α) 1 m . This test produces a family-wise Type I error rate of exactly α when the tests are independent from each other and all null hypotheses are true. [7] Holm's Test Let p 1 , . . . , p m be the p-values for the family of m null hypotheses H 1 , . . . , H m , and let us introduce the sorted values (from lowest to highest) denoted p (1) . . . p (m) with the associated hypotheses be H (1) , . . . , H (m) . Then for a given significance level α, if k is the minimal index such that p (k) > α m+1−k , we reject the null hypotheses H (1) , . . . , H (k−1) and do not reject H (k) , . . . , H (m) . If k = 1 then we do not reject any of the null hypotheses. If no such k exists, then we reject all of the null hypotheses. This method ensures that FWER ≤ α, where FWER is the family-wise error rate. [4] Hochberg's Test Let p 1 , . . . , p m be the p-values for the family of m null hypotheses H 1 , . . . , H m , and let us introduce the sorted values (from lowest to highest) denoted p (1) . . . p (m) with the associated hypotheses be H (1) , . . . , H (m) . For a given α , let R be the largest k such that P (k) ≤ α m−k+1 . Then, we reject the null hypotheses H (1) .

. . H (R) .[3]
Bonferroni's Test Let p 1 , . . . , p m be the p-values for the family of m null hypotheses H 1 , . . . , H m , where m 0 is the number of true null hypotheses. Let the family-wise error rate (FWER) be the probability of rejecting at least one true H i , that is, of making at least one type I error. Then the Bonferroni correction rejects the null hypothesis for each p i ≤ α m , thereby controlling the FWER at ≤ α. [7] Classic FDR Test For a given α, if k is the largest value such that P (k) ≤ k m α, we reject the null hypothesis (i.e., declare discoveries) for all H (i) for i = 1, . . . , k. [1,3] Storey's Test Let T 1 , . . . , T m be i.i.d. random variables representing the test statistics associated with m tests of the null hypothesis H 0 versus an alternative hypothesis H i , such that with probability of success for H i and H 0 respectively π 1 and π 0 = 1 − π 1 . We denote the critical region at significance level α by Γ α , corresponding to the values of T i for which H 0 is rejected beyond Γ α . Let an experiment yield a value t for the test statistic. The q-value of t is formally defined as That is, the q-value is the infimum of the pFDR if H 0 is rejected for test statistics with values ≥ t. Equivalently, the q-value equals inf {Γα:t∈Γα} which is the infimum of the probability that H 0 is true given that H 0 is rejected (the false discovery rate). We use the q-value to reject or accept the null hypothesis. [8] Winning Streaks in ODI Teams  .  .   .   .   .   Table S12: Winning streak statistics for team Zimbabwe in ODI format. In the table n denotes the length of the winning streak, f is the corresponding frequency of occurrence. p(n) and p(n f ) are the p-values for observation of streaks with length n and streaks with length n conditional on frequency f . With the help of multiple hypothesis tests mentioned in the previous section, we check the significance of the p-values and note down the results. We write result to be True, for a positive result, associated with a significant p-value below the standard confidence level of 0.05, otherwise it is False. xvii

Winning Streaks in Test Teams
Country n f p(n) p(n f ) Sidak Holm Hochberg Bonferroni Classic FDR Storey FDR Storey's Q .